Goto

Collaborating Authors

 turing test


The Year in Slop

The New Yorker

This was the year that A.I.-generated content passed a kind of audiovisual Turing test, sometimes fooling us against our better judgment. The Turing test, a long-established tool for measuring machine intelligence, gauges the point at which a text-generating machine can fool a human into thinking it's not a robot. ChatGPT passed that benchmark earlier this year, inaugurating a new technological era, though not necessarily one of superhuman intelligence . More recently, however, artificial intelligence passed another threshold, a kind of Turing test for the eye: the images and videos that A.I. can produce are now sometimes indistinguishable from real ones. As new, image-friendly models were trained, refined, and released by companies including OpenAI, Meta, and Google, the online public gained the ability to instantly generate realistic A.I. content on any theme they could imagine, from superhero fan art and cute animals to scenes of violence and war.


In Defense of the Turing Test and its Legacy

Gonçalves, Bernardo

arXiv.org Artificial Intelligence

Considering that Turing's original test was co-opted by Weizenbaum and that six of the most common criticisms of the Turing test are unfair to both Turing's argument and the historical development of AI. The Turing test has faced criticism for decades, most recently at the Royal Society event "Celebrating the 75th Anniversary of the Turing Test." The question of the Turing test's significance has intensified with recent advances in large language model technology, which now enable machines to pass it. In this article, I address six of the most common criticisms of the Turing test: The Turing test encourages fooling people; Turing overestimated human intelligence, as people can be easily fooled (the ELIZA effect); The Turing test is not a good benchmark for AI; Turing's 1950 paper is not serious and/or has contradictions; Imitation should not be a goal for AI, and it is also harmful to society; Passing the Turing test teaches nothing about AI. All six criticisms largely derive from Joseph Weizenbaum's influential reinterpretation of the Turing test. The first four fail to withstand a close examination of the internal logic of Turing's 1950 paper, particularly when the paper is situated within its mid-twentieth-century context.


Generalizing GANs: A Turing Perspective

Roderich Gross, Yue Gu, Wei Li, Melvin Gauci

Neural Information Processing Systems

They place two neural networks--a model and a discriminator--in a competitive setting. The discriminator's objective is to correctly label samples from either the model or the training data. The model's objective is to deceive the discriminator, in other words, to produce


Normality and the Turing Test

Kabbach, Alexandre

arXiv.org Artificial Intelligence

This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the Turing test is a test of normal intelligence as assessed by a normal judge. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires machines to "make mistakes" and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single "average" judge (understood as non-expert) but always by a full jury. As such, the notion of "average human interrogator" that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence, insofar as they deviate from the original goal of Turing for the modeling of artificial minds. Second, it argues that the objectivization of normal human behavior in the Turing test fails due to the game configuration of the test which ends up objectivizing normative ideals of normal behavior rather than normal behavior per se.


ElevenLabs CEO Mati Staniszewski on Darth Vader, Competition and Preventing Misuse

TIME - Tech

Pillay is an editorial fellow at TIME. Pillay is an editorial fellow at TIME. What is the split between your individual and enterprise customers? It was [previously] lower on the enterprise side. At the beginning of 2024, it was 90/10.


Assessing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation

Arita, Takaya, Zheng, Wenxian, Suzuki, Reiji, Akiba, Fuminori

arXiv.org Artificial Intelligence

This study explored how large language models (LLMs) perform in two areas related to art: writing critiques of artworks and reasoning about mental states (Theory of Mind, or ToM) in art-related situations. For the critique generation part, we built a system that combines Noel Carroll's evaluative framework with a broad selection of art criticism theories. The model was prompted to first write a full-length critique and then shorter, more coherent versions using a step-by-step prompting process. These AI-generated critiques were then compared with those written by human experts in a Turing test-style evaluation. In many cases, human subjects had difficulty telling which was which, and the results suggest that LLMs can produce critiques that are not only plausible in style but also rich in interpretation, as long as they are carefully guided. In the second part, we introduced new simple ToM tasks based on situations involving interpretation, emotion, and moral tension, which can appear in the context of art. These go beyond standard false-belief tests and allow for more complex, socially embedded forms of reasoning. We tested 41 recent LLMs and found that their performance varied across tasks and models. In particular, tasks that involved affective or ambiguous situations tended to reveal clearer differences. Taken together, these results help clarify how LLMs respond to complex interpretative challenges, revealing both their cognitive limitations and potential. While our findings do not directly contradict the so-called Generative AI Paradox--the idea that LLMs can produce expert-like output without genuine understanding--they suggest that, depending on how LLMs are instructed, such as through carefully designed prompts, these models may begin to show behaviors that resemble understanding more closely than we might assume.


ChatGPT passed the Turing Test. Now what?

Popular Science

ChatGPT passed the Turing Test. The AI fooled 73% of people into thinking it was human, raising new questions about machine intelligence. As artificial intelligence gets better and better, people face machines that look--and act--surprisingly human. Breakthroughs, discoveries, and DIY tips sent every weekday. It seems that every day brings a new headline about the burgeoning capabilities of large language models (LLMs) like ChatGPT and Google's Gemini--headlines that are either exciting or increasingly apocalyptic, depending on one's point of view. One particularly striking story arrived earlier this year: a paper that described how an LLM had passed the Turing Test, an experiment devised in the 1950s by computer science pioneer Alan Turing to determine whether machine intelligence could be distinguished from that of a human. The LLM in question was ChatGPT 4.5, and the paper found that it had been strikingly successful in fooling people into thinking it was human: In an experiment where participants were asked to choose whether the chatbot or an actual human was the real person, nearly three of the four chose the former.


Moravec's Paradox: Towards an Auditory Turing Test

Noever, David, McKee, Forrest

arXiv.org Artificial Intelligence

This research work demonstrate s that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly. Drawing inspiration from Moravec's paradox ( i.e., tasks simple for humans often prove difficult for machines, and vice vers a), we introduce a n auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee - shop noise, phone distortion, and perceptual illusions. Our evaluation of state - of - the - art audio models including GPT - 4's audio capabilities and OpenAI's Whisper reveals a striking failure rate exceeding 93%, with even the best - performing model achieving only 6.9% accuracy on tasks that humans solve d at 7.5 times higher success (52%). These results expose focusing failures in how AI systems process complex auditory scenes, particularly in selective attention, noise robustness, and contextual adaptation. Our benchmark not only quantifies the human - machine auditory gap but also provides insights into why these failures occur, su ggesting that current architectures lack fundamental mechanisms for human - like auditory scene analysis. The traditional design of audio CAPTCHAs highlight s common filters that humans evolved but machines fail to select in multimodal language models. This work establishes a diagnostic framework for measuring progress toward human - level machine listening and highlights the need for novel approaches integrating selective attention, physics - based audio understanding, and context - aware perception into mult imodal AI systems. Artificial intelligence has made great strides in language understanding and multimodal perception, yet machines still struggle with basic auditory tasks that humans perform successfully [1 - 20] . A striking example is the cocktail party effect [21 - 22 ] - the human ability to focus on a single conversation in a noisy room - which remains a formidable challenge for AI.


Dual Turing Test: A Framework for Detecting and Mitigating Undetectable AI

Messina, Alberto

arXiv.org Artificial Intelligence

In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the "dual Turing test", in which a human judge's goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge's task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and parameters tau and delta, and cast the interaction as a two-player zero-sum game over the adversary's feasible strategy set M. Next, we map this minimax game onto an RL-HF style alignment loop, in which an undetectability detector D provides negative reward for stealthy outputs, balanced by a quality proxy that preserves fluency. Throughout, we include detailed explanations of each component notation, the meaning of inner minimization over sequences, phased tests, and iterative adversarial training and conclude with a suggestion for a couple of immediate actions.


AI or Human? Understanding Perceptions of Embodied Robots with LLMs

Hriscu, Lavinia, Sanfeliu, Alberto, Garrell, Anais

arXiv.org Artificial Intelligence

The pursuit of artificial intelligence has long been associated to the the challenge of effectively measuring intelligence. Even if the Turing Test was introduced as a means of assessing a system intelligence, its relevance and application within the field of human-robot interaction remain largely underexplored. This study investigates the perception of intelligence in embodied robots by performing a Turing Test within a robotic platform. A total of 34 participants were tasked with distinguishing between AI- and human-operated robots while engaging in two interactive tasks: an information retrieval and a package handover. These tasks assessed the robot perception and navigation abilities under both static and dynamic conditions. Results indicate that participants were unable to reliably differentiate between AI- and human-controlled robots beyond chance levels. Furthermore, analysis of participant responses reveals key factors influencing the perception of artificial versus human intelligence in embodied robotic systems. These findings provide insights into the design of future interactive robots and contribute to the ongoing discourse on intelligence assessment in AI-driven systems.